This report explores a dataset containing information about different characteristics of red wines. The data contains 1599 observations of 13 variables.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
A summary of the dataset gives a quick overview.
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
The objective of this dataset is to meaure the quality of red wine with is assessed based on sensory data (median of at least 3 evaluations made by wine experts). The scale ranges from 0 (worst) to 10 (best) Above a histogram and a table of the quality rating show that most wines are rated in the middle numbers and the very good and very bad ratings are missing completely.
The input variables that could possibly influence the quality of the red wines are fixed acidity, volatily acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol. Histogram of all 11 variables are shown above to get an overview.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
It is a bit surprising (for me) that the alcohol content ranges from 8.4 to 14.9%, so I will look at the distribution in more detail.
The alcohol content seems to vary a lot. While most wines have alcohol contents of 10 or below (mean and median are indicated in blue and red lines, respectively), the distribution is right tailed and there are many red wines with much higher alcohol contents. I wonder if this quantity is related to the quality of the wine.
##
## 0.9 1.2 1.3 1.4 1.5 1.6 1.65 1.7 1.75 1.8 1.9 2 2.05 2.1 2.15
## 2 8 5 35 30 58 2 76 2 129 117 156 2 128 2
## 2.2 2.25 2.3 2.35 2.4 2.5 2.55 2.6 2.65 2.7 2.8 2.85 2.9 2.95 3
## 131 1 109 1 86 84 1 79 1 39 49 1 24 1 25
## 3.1 3.2 3.3 3.4 3.45 3.5 3.6 3.65 3.7 3.75 3.8 3.9 4 4.1 4.2
## 7 15 11 15 1 2 8 1 4 1 8 6 11 6 5
## 4.25 4.3 4.4 4.5 4.6 4.65 4.7 4.8 5 5.1 5.15 5.2 5.4 5.5 5.6
## 1 8 4 4 6 2 1 3 1 5 1 3 1 8 6
## 5.7 5.8 5.9 6 6.1 6.2 6.3 6.4 6.55 6.6 6.7 7 7.2 7.3 7.5
## 1 4 3 4 4 3 2 3 2 2 2 1 1 1 1
## 7.8 7.9 8.1 8.3 8.6 8.8 8.9 9 10.7 11 12.9 13.4 13.8 13.9 15.4
## 2 3 2 3 1 2 1 1 1 2 1 1 2 1 2
## 15.5
## 1
It seems that there are quite a few outliers on the higher end of residual sugar, so I will zoom in.
Disregarding the outliers on the higher end, residual suger is almost normally distributed around its mean value (close to two).
The distribution of fixed acidity is right tailed. Most wines have values around 7-8 g/dm³, but some wines have much higher fixed acidity.
Volatile acidity is almost normally distributed when neglecting the highest 1% of the values (the 99th percentile is indicated by a red line).
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
Most wines have values of citric acid around 0 g/dm³. There are, however, red wines in the dataset with higher values and one outliers with 1 g/dm³ citric acid.
Most red wines contain very little sodium chloride (between 0.05 and 0.10 g/dm³), but the original dataset (without zooming in) contains wines with salt levels up to 0.6 g/dm³ (but a very small number).
Free sulfur dioxide prevents microbial growth and the oxidation of wine. Most red wines have levels below 40 mg/dm³, but the mode of the distribution is found to be much smaller (around 5 mg/dm³).
In low concentrations total sulfur dioxide should not influence the quality of the wine, because “in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine” (from the data documentation). In the above plot the limit of 50 ppm (=50 mg/dm³) is indicated by a green line, showing that while many wines in the dataset have values below, some have much higher levels of total sulfur dioxide.
The density of the wines is close to that of pure water (=1 g/cm³, indicated by the blue dashed line), however, the mean (red solid line) is slightly lower. This is probably due to the alcohol which is lighter than water. A small amount of wines is more dense than pure water. These could be the ones that contain a lot of residual sugar.
The pH value describes the acidity of the red wines. The distribution is almomst normal with most values between 3.0 and 3.7.
Most wines have relatively low sulphates levels, but there are a few outliers at the higher end, so I zoom in excluding the top 1%.
The red wines dataset contains 1599 observations of 13 variables. The first variable is an id number. The following 11 describe properties of the red wines: fixed acidity, volatily acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol. The last variable is the output, the quality of the red wine. The quality is based on sensory data (median of at least 3 evaluations made by wine experts). The scale ranges from 0 (worst) to 10 (best).
The main feature in this dataset is the quality of the red wines. However, most red wines have been rated with medium marks. Very good and very bad ratings are missing in this data set.
I suspect alcohol will influence the taste of the wine, but also other chemical properties might be important. A bi- or multivariate exploration of the variables is needed to find out.
In the bivariate analysis section I found that because most ratings are in the medium range, I needed to created quality groups of “bad wines” (ratings of 3 and 4), “medium wines” (ratings of 5 and 6), and “good wines” (ratings of 7 and 8).
There are quite a few outliers on the high end for residual sugar content. Therefore, when looking at this quantity I exclude the top 5%. For other quantities it is also often convenient to ignore the highest and lowest 1%.
Alcohol content and quality seem to be related (higher alcohol seems to go along with a higher quality), but I am not able to find a clear linear trend. Also using techniques to avoid overplotting helps getting a clearer picture, but does not show me a clear linear relationship. I will have to look at further variables.
At a first glance residual sugar does not seem to have a high relationship with the quality of the wines. However, most values of residual sugar are below 4 with some very high outliers, so I will remove the top 5%.
Even zooming in does not show me a clear relationship between residual sugar and the quality of the red wines. I will move on to other variables.
From this scatterplot a linear relationship between quality and fixed acidity is not obvious.
There seems to be an increase in quality with a decreasing volatile acidity. This is reasonable, since high values of volatile acidity lead to a vinegar-like taste. I will have to explore this further.
There seems to be a positive relationship between quality and citric acid, but because most data points are in the quality ratings of 5 and 6, it is hard to see a real linear relationship from this plot.
With zooming in (excluding the top 3% and bottom 1% of data points), there seems to be a slight negative relationship between quality and chlorides.
A relationship between free sulfur dioxide and quality cannot be found from this scatterplot.
Excluding the top 1% of data points there is a (very) slight hint of a negative relationship between quality and total sulfur dioxide, but this scatterplot alone is not convincing enough.
A small negative correlation between quality and density can be seen here. This can, however, be related to the fact that density is strongly controlled by the alcohol content (which I suspect and will explore later on) and alcohol is related to the quality.
Even when zooming in (excluding the top and lowest 1%), a relationship between pH and the quality cannot be found from this scatterplot.
When excluding the top 1% of data points, a positive linear relationship between quality and sulphates can be seen in the above scatterplot.
Looking at scatterplots of the other chemical properties vs. quality I can see some positive and some negative relationships but mostly it is hard to say, because most wines received ratings of 5 and 6. I wonder if it might be better to just compare good wines (ratings of 7 and 8) to bad wines (ratings of 3 and 4).
## X fixed.acidity volatile.acidity citric.acid
## Min. : 19.0 Min. : 4.600 Min. :0.2300 Min. :0.0000
## 1st Qu.: 435.0 1st Qu.: 6.800 1st Qu.:0.5650 1st Qu.:0.0200
## Median : 834.0 Median : 7.500 Median :0.6800 Median :0.0800
## Mean : 837.7 Mean : 7.871 Mean :0.7242 Mean :0.1737
## 3rd Qu.:1285.5 3rd Qu.: 8.400 3rd Qu.:0.8825 3rd Qu.:0.2700
## Max. :1522.0 Max. :12.500 Max. :1.5800 Max. :1.0000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 1.200 Min. :0.04500 Min. : 3.00
## 1st Qu.: 1.900 1st Qu.:0.06850 1st Qu.: 5.00
## Median : 2.100 Median :0.08000 Median : 9.00
## Mean : 2.685 Mean :0.09573 Mean :12.06
## 3rd Qu.: 2.950 3rd Qu.:0.09450 3rd Qu.:15.50
## Max. :12.900 Max. :0.61000 Max. :41.00
## total.sulfur.dioxide density pH sulphates
## Min. : 7.00 Min. :0.9934 Min. :2.740 Min. :0.3300
## 1st Qu.: 13.50 1st Qu.:0.9957 1st Qu.:3.300 1st Qu.:0.4950
## Median : 26.00 Median :0.9966 Median :3.380 Median :0.5600
## Mean : 34.44 Mean :0.9967 Mean :3.384 Mean :0.5922
## 3rd Qu.: 48.00 3rd Qu.:0.9977 3rd Qu.:3.500 3rd Qu.:0.6000
## Max. :119.00 Max. :1.0010 Max. :3.900 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.60 1st Qu.:4.000
## Median :10.00 Median :4.000
## Mean :10.22 Mean :3.841
## 3rd Qu.:11.00 3rd Qu.:4.000
## Max. :13.10 Max. :4.000
## [1] 63 13
The above shows the summary for the “bad wines”, i.e. quality ratings of 3 and 4. 63 of the original dataset are considered “bad wines”.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.700 Min. :0.1600 Min. :0.0000
## 1st Qu.: 382.5 1st Qu.: 7.100 1st Qu.:0.4100 1st Qu.:0.0900
## Median : 768.0 Median : 7.800 Median :0.5400 Median :0.2400
## Mean : 793.0 Mean : 8.254 Mean :0.5386 Mean :0.2583
## 3rd Qu.:1219.5 3rd Qu.: 9.100 3rd Qu.:0.6400 3rd Qu.:0.4000
## Max. :1599.0 Max. :15.900 Max. :1.3300 Max. :0.7900
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.03400 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07100 1st Qu.: 8.00
## Median : 2.200 Median :0.08000 Median :14.00
## Mean : 2.504 Mean :0.08897 Mean :16.37
## 3rd Qu.: 2.600 3rd Qu.:0.09100 3rd Qu.:22.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.860 Min. :0.3700
## 1st Qu.: 24.00 1st Qu.:0.9958 1st Qu.:3.210 1st Qu.:0.5400
## Median : 40.00 Median :0.9968 Median :3.310 Median :0.6100
## Mean : 48.95 Mean :0.9969 Mean :3.311 Mean :0.6473
## 3rd Qu.: 65.00 3rd Qu.:0.9979 3rd Qu.:3.400 3rd Qu.:0.7000
## Max. :165.00 Max. :1.0037 Max. :4.010 Max. :1.9800
## alcohol quality
## Min. : 8.40 Min. :5.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.00 Median :5.000
## Mean :10.25 Mean :5.484
## 3rd Qu.:10.90 3rd Qu.:6.000
## Max. :14.90 Max. :6.000
## [1] 1319 13
The above summary is for the “medium wines” with ratings of 5 and 6. 1316 members belong to this group.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 8.0 Min. : 4.900 Min. :0.1200 Min. :0.0000
## 1st Qu.: 482.0 1st Qu.: 7.400 1st Qu.:0.3000 1st Qu.:0.3000
## Median : 939.0 Median : 8.700 Median :0.3700 Median :0.4000
## Mean : 831.7 Mean : 8.847 Mean :0.4055 Mean :0.3765
## 3rd Qu.:1089.0 3rd Qu.:10.100 3rd Qu.:0.4900 3rd Qu.:0.4900
## Max. :1585.0 Max. :15.600 Max. :0.9150 Max. :0.7600
## residual.sugar chlorides free.sulfur.dioxide
## Min. :1.200 Min. :0.01200 Min. : 3.00
## 1st Qu.:2.000 1st Qu.:0.06200 1st Qu.: 6.00
## Median :2.300 Median :0.07300 Median :11.00
## Mean :2.709 Mean :0.07591 Mean :13.98
## 3rd Qu.:2.700 3rd Qu.:0.08500 3rd Qu.:18.00
## Max. :8.900 Max. :0.35800 Max. :54.00
## total.sulfur.dioxide density pH sulphates
## Min. : 7.00 Min. :0.9906 Min. :2.880 Min. :0.3900
## 1st Qu.: 17.00 1st Qu.:0.9947 1st Qu.:3.200 1st Qu.:0.6500
## Median : 27.00 Median :0.9957 Median :3.270 Median :0.7400
## Mean : 34.89 Mean :0.9960 Mean :3.289 Mean :0.7435
## 3rd Qu.: 43.00 3rd Qu.:0.9973 3rd Qu.:3.380 3rd Qu.:0.8200
## Max. :289.00 Max. :1.0032 Max. :3.780 Max. :1.3600
## alcohol quality
## Min. : 9.20 Min. :7.000
## 1st Qu.:10.80 1st Qu.:7.000
## Median :11.60 Median :7.000
## Mean :11.52 Mean :7.083
## 3rd Qu.:12.20 3rd Qu.:7.000
## Max. :14.00 Max. :8.000
## [1] 217 13
A summary for “good wines” with ratings of 7 or 8 is shown above. 217 red wines of the original sample are considered “good”.
##
## (2,4] (4,6] (6,8]
## 63 1319 217
Before I created subsets of the original dataframe to look at the summaries, but to make plots, I have created a new variable “quality.bucket”.
The boxplots reveal that the median of fixed acidity increases from bad to median to good wines. The IQR gets also greater, but comparing the spread might not be very meaningful considering the very different sample sizes in the three groups.
For volatile acidity the scatterplots already showed a negative linear relationship to quality. The boxplots for the three quality groups confirm this.
While the scatterplot of quality vs. citric acid was hard to read, the boxplots show a clear increase in quality with citric acid.
Even when excluding the top 5%, a relationship between quality and residual sugar can still not be determined by the boxplots.
“Bad” wines have higher chlorine levels than “good” wines, but the IQRs overlap almost entirely and the difference is not great. A real relationship cannot be determined from this plot.
Just as the scatterplot, the boxplots do not show a linear relationship between quality and free sulfur dioxide.
The scatterplot gave a hint of a negative linear relationship between quality and total sulfur dioxide, but the boxplots do not confirm this. “Bad” wines have median levels of total sulfur dioxide that are about the same as for “good” wines. Only “medium” wines have higher levels.
“Good” wines have a lower median density than “bad” or “medium” wines, but a linear relationship between quality and density cannot be found from the boxplots.
The boxplots reveal a slight negative relationship between quality and pH, but since the IQR overlap almost entirely for “medium” and “good” wines, this seems not to be very significant.
The positive linear relationship between quality and sulphates found in the scatterplot can be confirmed by the boxplots.
While alcohol content seems to be very important for the high quality wines (“good” wines have much higher alcohol content), it does not significantly differ between “bad” and “medium” wines.
Now looking at buckets of quality ratings, the picture is much clearer than just from the scatterplots alone.
There seems to be a positive linear relationship between quality and fixed acidity as well as citric acid and sulphates. For alcohol content the grouping into quality groups shows that “bad” and “medium” red wines have on average the same alcohol content (even though medium wines have more outliers in the positive direction), but good wines have a much higher alcohol content. This does not seem to be a strictly linear relationship though.
A negative linear relationship to quality can be found for volatile acidity and the pH value.
The other variables do not show a linear relationship to quality even when grouping quality ratings.
To complete the picture, I will compute Pearson’s correlation coefficients for all variables vs quality.
##
## Pearson's product-moment correlation
##
## data: quality and fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
##
## Pearson's product-moment correlation
##
## data: quality and volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
##
## Pearson's product-moment correlation
##
## data: quality and citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
##
## Pearson's product-moment correlation
##
## data: quality and residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
##
## Pearson's product-moment correlation
##
## data: quality and chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
##
## Pearson's product-moment correlation
##
## data: quality and free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
##
## Pearson's product-moment correlation
##
## data: quality and total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
##
## Pearson's product-moment correlation
##
## data: quality and density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
##
## Pearson's product-moment correlation
##
## data: quality and pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
##
## Pearson's product-moment correlation
##
## data: quality and sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
##
## Pearson's product-moment correlation
##
## data: quality and alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
The Pearson’s correlation between the quality (now again the original dataset) and the variables sometimes gives a different picture than the plots. For example, from the correlation coefficient it appears that alcohol content has the strongest correlation to the quality of the red wine.
Fixed and volatile acidity do not seem to be linearly related. However, if both are plotted (individually) against citric acid linear relationships can be observed (positive for fixed and negative for volatile acidity).
##
## Pearson's product-moment correlation
##
## data: fixed.acidity and volatile.acidity
## t = -10.589, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3013681 -0.2097433
## sample estimates:
## cor
## -0.2561309
##
## Pearson's product-moment correlation
##
## data: fixed.acidity and citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
The Pearson’s correlation coefficients confirm these relationships. However, it appears that fixed and volatile acidity are also negatively linear related with a rather weak correlation coefficient though), even though it is not obvious from the plot.
##
## Pearson's product-moment correlation
##
## data: density and residual.sugar
## t = 15.189, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3116908 0.3973835
## sample estimates:
## cor
## 0.3552834
##
## Pearson's product-moment correlation
##
## data: density and alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
The density of red wines is very close to the density of water (1 g/cm³), but decreases with increasing alcohol content and increases with increasing residual sugar content.
Most wines had been rated with a quality mark of 5 or 6 with only fewer wines receiving rating of 3, 4, 7, or 8. Therefore, to really see a linear relationship, it was necessary to group the ratings of the red wines into three categories and compare boxplots for the three groups.
Some but not all of the variables in this dataset show a clear relationship to the quality of the red wines. Positive linear relationships can be observed for fixed acidity, citric acid and sulphates. Residual sugar and alcohol content also show positive linear relationships, but the relationships seem to be weaker. Volatile acidity and the pH value have negative linear relationships to the quality of the red wines.
The other variables do not show a linear relationship, but that does not mean that they are completely unrelated to the quality of the wine.
The density of the red wines is determined mostly by the alcohol content and the residual sugar. It decreases with increasing alcohol content and increases with increasing residual sugar content.
Fixed and volatile acidity seem to be linearly related to citric acid (and weakly to each other). However, the relationship is not that great that one can substitute the other two completely when looking at the quality of the red wines.
According to the plot and the Pearson’s correlation coefficient volatile acidity seems to have a very strong relationship to the quality of the red wine. That makes sense, as the description says: “volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste”.
The correlation coefficient for alcohol was even higher, but here the relationship does not seem to be strictly linear as the difference for “bad” and “medium” wines is very small. Only for “good” wines the alcohol content is much higher.
This plot confirms again that the density of the wines increases with increasing residual sugar and decreasing alcohol. This multivariate plot combines these two facts so that only one plot is necessary instead of two.
For the bivariate analysis I found that volatile acidity and alcohol have the strongest correlation coefficients. The plot above confirms again that both are strongly related to the quality of the red wines, however, for alcohol the relationship does not seem to be linear.
## # A tibble: 6 x 9
## quality alcohol_mean volatile.acidit… citric.acid_mean fixed.acidity_m…
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 3 9.96 0.884 0.171 8.36
## 2 4 10.3 0.694 0.174 7.78
## 3 5 9.90 0.577 0.244 8.17
## 4 6 10.6 0.497 0.274 8.35
## 5 7 11.5 0.404 0.375 8.87
## 6 8 12.1 0.423 0.391 8.57
## # … with 4 more variables: sulphates_mean <dbl>, pH_mean <dbl>,
## # residual.sugar_mean <dbl>, n <int>
A new dataframe gives me the mean of the important variables grouped by the quality ratings. A count (n) of how many members each group holds is also included.
Plotting the means for of alcohol and volatile acidity for each quality group together confirms again that the quality increases with increasing alcohol (especially for the higher quality wines) and with decreasing volatile acidity (especially for the lower and medium quality wines).
The scatterplot reveals the positive relationship citric acid has with the quality of the red wines. However, adding the pH value, it appears that pH is more related to citric acid than to the quality of the wines. Looking at the line graphs that show the means for each rating reveals why: for the medium rated wines (which is the biggest group) the pH value does not seem to vary much. Only the rather bad wines differ significantly from the better wines with having a higher pH value.
The positive linear relationship between quality and sulphates as well as fixed acidity can be seen in the above scatterplot (even though it is less obvious for fixed acidity).
The lineplot for the means of each quality group confirms that there is a clear positive linear relationship between the quality and sulphates and a less pronounced but still visible positive relationship between quality and fixed acidity.
The two variables that are most strongly related to the quality of the red wines (in terms of correlation coefficient) are alcohol and volatile acidity. A combined scatterplot confirms these relationships. Even when holding levels of volatile acidity constant, alcohol content is higher in the higher quality wines. However, volatile acidity and alcohol do not seem to be totally unrelated to each other, since there is more often higher alcohol content to be found in wines with lower than in wines with higher volatile acidity.
While citric acid has a relatively strong positive linear relationship with the quality of the red wines, the pH value (which is related to citric acid) shows a much weaker relationship.
Fixed acidity and sulphates are both related to the quality of the wines but when plotting both together the plot is not easy to read anymore, probably because of too weak linear relationships.
Looking at it from different agles I could still not find a meaningful relationship between residual sugar and the quality of the wines.
The pH value is so strongly related to citric acid that holding citric acid constant, the variation in pH is not very strong and a relationship to the quality cannot be determined anymore.
The above plot is esentially the scatterplot from the bivariate section showing the alcohol content by the quality rating. It has been advanced by a colorscale for quality groups I have created (“bad”, “medium”, and “good” wines) and a line showing the mean alcohol content for each rating (like in the multivariate plotting section) has also been added.
This plot shows clearly that wines that receive higher ratings have on average higher alcohol contents. It also shows that most wines received ratings in the middle sections with only less ratings for “bad” and “good” wines.
The above plot shows boxplots for volatile acidity, which was found to be next to alcohol the variable which has the strongest influence on the quality of the wines. The boxplots are grouped for the “bad wines” (ratings 3 and 4), “medium wines” (ratings 5 and 6) and “good wines” (ratings 7 and 8). It is essentially the same plot as the one from the bivariate section, but a line showing the mean volatile acidity for each quality rating is added.
Decreasing levels of volatile acidity are related to a higher quality rating. Only for the very high ratings a difference cannot be found anymore. That makes sense since high levels of volatile acidity give wine a vinegar-like taste. If the level of volatile acidity is already relatively low, it should not influence the quality anymore.
Plotting volatile acidity and alcohol together versus the quality of the red wines the bivariate relationships can be confirmed again. Even when holding levels of volatile acidity constant, alcohol content is higher in the higher quality wines. However, volatile acidity and alcohol do not seem to be totally unrelated: there is more often higher alcohol content to be found in wines with lower than in wines with higher volatile acidity.
The red wine dataset contains 11 variables that describe properties of wine as well as a quality rating. I have started out looking at a histogram of the ratings and found that most wines had been rated in the medium (5 or 6) range, without any ratings at very low or very high end.
I have also looked at the 11 variables to see how they are distributed. Trying to find out which properties of the red wines influence the quality the most, I have looked into (mostly linear) relationships. I found that the quality of red wines seems to improve with increasing alcohol, fixed acidity, citric acid, and sulphates and with decreasing volatile acidity and pH value. For residual sugar there was a hint of a positive linear relationship, but too weak to be sure.
I have tried to learn about the chemical properties of red wine to explore this dataset, but this does not substitute real domain knowledge which would have helped me to make this analysis more meaningful and would enable me to ask more interesting questions.
Since most of the wines had ratings of 5 or 6 and only very few of 3 and 4 or 7 and 8 (and no ratings at the lower or upper ends at all), it was not easy to find relationships. I had to use different ways to group the wines into quality groups to work with the data. A more diverse dataset containing data on wines from the entire quality spectrum (ideally evenly distributed) would make the exploration easier and would probably lead to more meaningful results. With such a dataset I would have tried to build a model to predict the quality of the red wines.